Process data
- The data is relatively tidy with few missing data. Also we get extra data of longitude and latitude using packages in R.
library(ggplot2)
library(readxl)
library(dplyr)
data1 = read_excel("data/Changes in Urban Land, Population and Density by Country.xlsx", sheet = 'Countries', skip=2)
data1 = data1[-c(1,20,21,22),-16]
data1$Country[c(4,7,13,15)] = c("North Korea", "Laos", "South Korea", "Taiwan")
country_name = data1$Country
data2 = read_excel("data/Urban Areas with Populations Greater Than 100,000 People.xlsx", sheet = 'Urban_Areas_by_pop', skip = 2)
data2 = data2 %>% na.omit()
data2 = data2[-1]
temp = factor(data2$Country)
levels(temp) = country_name
data2$Country = as.character(temp)
city_name = gsub(" urban area", "", data2$`Urban Area Name`)
city_name = gsub("'", " ", city_name)
city_name = paste(city_name, data2$Country, sep=", ")
#coord = geocode(city_name, messaging = F)
#write.csv(coord, file = "EDA_Project/data/coord.csv", row.names=FALSE)
coord = read.csv("data/coord.csv")
data2$lon = coord$lon
data2$lat = coord$lat
Variable 1: Country
g=ggplot(data2,aes(Country))+
geom_histogram(fill="skyblue", color="white", stat = "count")+
ggtitle("Histogram of number of urban areas in each country")+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
g

- The histogram of the catagorical variable
Country shows the number of urban areas that appear in the data.The variable Country is very important for our analysis in project. The plot provides us the distribution of urban areas and a general idea of the scale of population and land area. It is clear that China dominates other Eastern Asian countries in the number of urban areas by a large number of urban areas. In terms of data quality, we need to realize that the data cannot represent East Asian countries as identical entity and we need to analyze by countries with more focus on China specifically. In order to visualize the number of urban areas of other countries more efficiently with reasonable scaling, we exclude China in the next histogram.
no_china=data2[which(data2$Country!="China"),]
g=ggplot(no_china,aes(Country))+
geom_histogram(fill="skyblue", color="white", stat = "count")+
ggtitle("Histogram of number of urban areas without China")+
theme(axis.text.x = element_text(angle = 30, hjust = 1))
g

- In this histogram,
Indonesia takes the second place and Japan takes the third in the number of urban areas. They also dominate the rest of countries but with much fewer amount of urban areas compared to China. Some of the countries like Brunei may only have a few urban area in our dataset because such countries are so small in terms of urban land size and urban population. In this case, such countries will not impact much about our analysis of East Asian urbanizaiton in general.
Variable 2: Average annual rate of increase in urban land 2000 - 2010 (%)
g=ggplot(data2,aes(`Average annual rate of increase in urban land 2000 - 2010 (%)`))+
geom_histogram(fill="skyblue", color="white")+
ggtitle("Histogram of Average annual rate of increase in urban land 2000 - 2010")
g

ggplot(data2, aes(x="",y=`Average annual rate of increase in urban land 2000 - 2010 (%)`))+
geom_boxplot(outlier.colour="red", outlier.shape=16,
outlier.size=2, notch=T)

- The histogram shows that from 2000 to 2010 in East Asia, urban land size increases with urbanization. Most of urban areas do have their urban land increase at the average annual rate from 0% to 5%, preferably 0%. Few urban areas increase beyond annual rate of 10%. But no urban area shrinks during urbanization. This indicates that the expanding urban areas is necessary for urbanization process even though at relatively low rate. In terms of data quality, this variable is generally good for our analysis even with few outliers. The boxplot shows that all outliers are in the larger side of the variable values. We can take those outliers out and analyze them individually to see the reason why the rate is so high.
Variable 3: Average annual rate of change of urban population (%)
g=ggplot(data2,aes(`Average annual rate of change of urban population (%)`))+
geom_histogram(fill="skyblue", color="white")+
ggtitle("Histogram of Average annual rate of increase in urban population")
g

ggplot(data2, aes(x="",y=`Average annual rate of change of urban population (%)`))+
geom_boxplot(outlier.colour="red", outlier.shape=16,
outlier.size=2, notch=T)

- This histogram shows that from 2000 to 2010, most of the urban areas in East Asia have increased in urban population size while few have decreased in urban population size. The boxplot shows that there are some outliers on both sides but the outliers seem reasonable since some urban areas grow much faster than others in population. The range of the rate of change of urban population is acceptable and the distribution is rougly normal and this variable is good to use in the analysis.
Variable 4: Urban expansion per additional urban inhabitant (sq. m./ person)
g=ggplot(data2,aes(`Urban expansion per additional urban inhabitant (sq. m./ person)`))+
geom_histogram(fill="skyblue", color="white")+
ggtitle("Histogram of Urban expansion per additional urban inhabitant (Original)")
g

ggplot(data2, aes(x="",y=`Urban expansion per additional urban inhabitant (sq. m./ person)`))+
geom_boxplot(outlier.colour="red", outlier.shape=16,
outlier.size=2, notch=T)

- Both the histogram and the boxplot of
Urban expansion per additional urban inhabitant (sq. m./ person) shows that there are several outliers in this variable. In addition, there is one outlier with extremely negative value that makes the two plots look strange. Also, there are some other outliers with smaller deviation from the most of the data values and we can exclude them to see a clearer distribution of this variable.
most = data2[data2$`Urban expansion per additional urban inhabitant (sq. m./ person)`>-1e+03&data2$`Urban expansion per additional urban inhabitant (sq. m./ person)`<2000,]
g=ggplot(most,aes(`Urban expansion per additional urban inhabitant (sq. m./ person)`))+
geom_histogram(fill="skyblue", color="white",binwidth = 50)+
ggtitle("Histogram of Urban expansion per additional urban inhabitant (modified)")
g

- In this histogram, with elimination of values below -1000 and beyond 2000, we get a much nicer visualization of the distribution of Urban expansion per additional urban inhabitant from 2000 to 2010. It shows that most of the urban areas in East Asia have approximately 0 to 300 urban expansion per additional urban inhabitant. There are still some urban areas have negative value and this corresponds to the two histograms above which show that in those urban areas, their land sizes increase while their population decrease.